Please wait a minute...
Advanced Search
现代图书情报技术  2014, Vol. 30 Issue (7): 24-33     https://doi.org/10.11925/infotech.1003-3513.2014.07.04
  数字图书馆 本期目录 | 过刊浏览 | 高级检索 |
针对训练集分布偏斜问题的数字资源文本分类方法
李湘东1,2, 何海红1, 曹环1, 黄莉3
1. 武汉大学信息管理学院, 武汉430072;
2. 武汉大学信息资源研究中心, 武汉430072;
3. 武汉大学图书馆, 武汉430072
An Algorithm of Digital Resources Text Categorization for Training Sets Skewed Distribution
Li Xiangdong1,2, He Haihong1, Cao Huan1, Huang Li3
1. School of Information Management, Wuhan University, Wuhan 430072, China;
2. Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3. Wuhan University Library, Wuhan 430072, China
全文: PDF (749 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的]调整训练集分布的不均衡性,以提高科学分类体系下数字资源文本的分类性能。[方法]提出基于粒划分和LDA相结合的新方法B-LDA,首先根据划分准则对训练集进行分割,实现粒度空间的转换,然后采用概率主题模型(LDA)对文本建模,利用类全局语义信息生成新文本,从而使训练集达到分布均衡。[结果]仿真实验结果表明:随着特征项数的变化,在不同偏斜程度训练集上F1值有2.7%至9.9%不等的提升。[局限]由于语料库规模的限制,构造训练集进行实验时,只涉及部分偏斜情况;此外,实验随机选取的两个类别的可分性会对新方法的分类性能造成影响。[结论]该方法可有效提高以图书书目信息、期刊题录信息、网页等数字资源为文本内容的分布偏斜训练集的分类性能。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
何海红
李湘东
黄莉
曹环
Abstract

[Objective] To improve digital resources text categorization in hierarchical structure by adjusting skewed distribution in training sets.[Methods] This paper proposes a new method named B-LDA to improve text categorization by integrating granule partitions with LDA. The new method firstly divides rare classes based on granular partition criteria to realize transferring the granularity space of training set, then modeles important texts based on probabilistic topic models, and generates new texts by using global semantic information represented by probabilistic topic models, until the distribution of different categories becomes more balanced.[Results] The results show that with the changing of the number of characters, the F1-Value for different unbalanced level training sets has been improved between 2.7% and 9.9%.[Limitations] This paper involves only part of imbalance condition, when constructs training set for experiments because of the limitation of corpus scale. In addition, the overlap degree of the two categories selected randomly will affect the classification performance of the new method.[Conclusions] The new method can achieve better performance under imbalance data sets which composed by the text information of the bibliography of books, the title of journals and Web pages.

Key wordsSkewed distribution    Granule partitions    Probabilistic topic models    Text categorization    Digital resources
收稿日期: 2014-03-09      出版日期: 2014-10-20
:  TP391  
通讯作者: 黄莉E-mail:huangcomplete@gmail.com     E-mail: huangcomplete@gmail.com
作者简介: 作者贡献声明:李湘东:提出命题及研究思路,最终版本修订;何海红:数据采集、实验及论文的撰写;曹环:设计研究方案、数据分析及论文的起草;黄莉:数据分析及最终版本修订。
引用本文:   
李湘东, 何海红, 曹环, 黄莉. 针对训练集分布偏斜问题的数字资源文本分类方法[J]. 现代图书情报技术, 2014, 30(7): 24-33.
Li Xiangdong, He Haihong, Cao Huan, Huang Li. An Algorithm of Digital Resources Text Categorization for Training Sets Skewed Distribution. New Technology of Library and Information Service, 2014, 30(7): 24-33.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2014.07.04      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2014/V30/I7/24

[1] 魏大威, 刘金哲, 薛尧予. 以数字图书馆推广工程为抓手, 构建覆盖全国的数字图书馆服务体系[J]. 国家图书馆学刊, 2012, 21(5): 14-19. (Wei Dawei, Liu Jinzhe, Xue Yaoyu. Using the Digital Library Promotion Project as a Driver, Construct a Country-Wide Digital Library Service Architecture[J]. Journal of the National Library of China, 2012, 21(5): 14-19.)
[2] 王军. 数字图书馆的知识组织系统: 从理论到实践[M]. 北京: 北京大学出版社, 2008. (Wang Jun.The Knowledge Organization System in Digital Library——From Theory to Practice[M]. Beijing: Peking University Press, 2008.)
[3] Wang J. An Extensive Study on Automated Dewey Decimal Classification[J]. Journal of the American Society for Information Science & Technology, 2009, 60(11): 2269-2286.
[4] 肖雪, 何中市. 基于向量空间模型的中文文本层次分类方法研究[J]. 计算机应用, 2006, 26(5): 1125-1126, 1133. (Xiao Xue, He Zhongshi. Hierarchical Categorization Methods of Chinese Text Based on Vector Space Model[J]. Computer Applications, 2006, 26(5): 1125-1126, 1133.)
[5] 何琳, 侯汉清, 白振田, 等. 基于标引经验和机器学习相结合的多层自动分类[J]. 情报学报, 2006, 25(6): 725-729. (He Lin, Hou Hanqing, Bai Zhentian, et al. Automatic Multi- Layer Classification Method Based on Integration of Machine Learning and Indexing Experience[J]. Journal of the China Society for Scientific and Technical Information, 2006, 25 (6): 725-729.)
[6] 张启蕊, 张凌, 董守斌, 等. 训练集类别分布对文本分类的影响[J]. 清华大学学报: 自然科学版, 2005, 45(S1): 1802-1805. (Zhang Qirui, Zhang Ling, Dong Shoubin, et al. Effects of Category Distribution in a Training Set on Text Categorization[J]. Journal of Tsinghua University: Science and Technology, 2005, 45(S1): 1802-1805.)
[7] 肖希明, 郑燃. 国外图书馆、档案馆和博物馆数字资源整合研究进展[J]. 中国图书馆学报, 2012, 38(3): 26-39. (Xiao Ximing, Zheng Ran. Research Progress on Digital Resources Convergence of Libraries, Archives and Museums in Foreign Countries[J]. Journal of Library Science in China, 2012, 38(3): 26-39.)
[8] 林琛, 李弼程, 周杰. 基于信息粒度的交叠类文本分类方法[J]. 情报学报, 2011, 30(4): 339-346. (Lin Chen, Li Bicheng, Zhou Jie. A Text Categorization Method for Overlapping Classes Based on Information Granularity[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(4): 339-346.)
[9] García V, Alejo R, Sánchez J S, et al. Combined Effects of Class Imbalance and Class Overlap on Instance-Based Classification[A] //Intelligent Data Engineering and Automated Learning–IDEAL 2006[M]. Berlin, Heidelberg: Springer, 2006: 371-378.
[10] Orriols A, Bernadó-Mansilla E. The Class Imbalance Problem in Learning Classifier Systems: A Preliminary Study[C]. In: Proceedings of the 2005 Workshops on Genetic and Evolutionary Computation. ACM, 2005: 74-78.
[11] Japkowicz N, Stephen S. The Class Imbalance Problem: A Systematic Study[J]. Intelligent Data Analysis, 2002, 6(5): 429-449.
[12] 夏战国, 夏士雄, 蔡世玉, 等.类不均衡的半监督高斯过程分类算法[J]. 通信学报, 2013, 34(5):42-51. (Xia Zhanguo, Xia Shixiong, Cai Shiyu, et al. Semi-Supervised Gaussian Process Classification Algorithm Addressing the ClassImbalance[J]. Journal on Communications, 2013, 34(5): 42-51.)
[13] Jo T, Japkowicz N. Class Imbalances Versus Small Disjuncts[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 40-49.
[14] 江颉, 王卓芳, Gong Rongsheng, 等. 不平衡数据分类方法及其在入侵检测中的应用研究[J]. 计算机科学, 2013, 40(4): 131-135. (Jiang Jie, Wang Zhuofang,Gong Rongsheng, et al. Imbalanced Data Classification and Its Application Research for Intrusion Detection[J]. Computer Science, 2013, 40(4): 131-135.)
[15] Estabrooks A, Jo T, Japkowicz N. A Multiple Resampling Method for Learning from Imbalanced Data Sets[J]. Computational Intelligence, 2004, 20(1): 18-36.
[16] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
[17] Han H, Wang W Y, Mao B H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning[C]. In: Proceedings of International Conference on intelligent Computing (ICIC 2005), Hefei, China. Berlin, Heidelberg: Springer, 2005: 878-887.
[18] Batista G E, Prati R C, Monard M C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data[J]. ACM Sigkdd Explorations Newsletter, 2004, 6(1): 20-29.
[19] Chen E, Lin Y, Xiong H, et al. Exploiting Probabilistic Topic Models to Improve Text Categorization Under Class Imbalance[J]. Information Processing & Management, 2011, 47(2): 202-214.
[20] 张清华, 王国胤, 胡军, 等. 多粒度知识获取与不确定性度量[M]. 北京: 科学出版社, 2013. (Zhang Qinghua, Wang Guoyin, Hu Jun, et al. Multi-Granularity Knowledge Acquisition and Measure of Uncertainty[M]. Beijing: Science Press, 2013.)
[21] 郭虎升, 亓慧, 王文剑. 处理非平衡数据的粒度SVM学习算法[J]. 计算机工程, 2010, 36(2): 181-183. (Guo Husheng, Qi Hui, Wang Wenjian. Granular SVM Learning Algorithm for Processing Imbalanced Data[J]. Computer Engineering, 2010, 36(2): 181-183.)
[22] 林洋港, 陈恩红. 文本分类中基于概率主题模型的噪声处理方法[J]. 计算机工程与科学, 2010, 32(7): 89-92, 119. (Lin Yanggang, Chen Enhong. A Probabilistic Topic Model Based Noise Processing Method for Text Classification[J]. Computer Engineering and Science, 2010, 32(7): 89-92, 119.)
[23] Zadeh L A. Fuzzy Sets and Information Granularity[A] //Advances in Fuzzy Set Theory and Applications[M]. Amsterdam: North-Holland Publishing Co., 1979: 3-18.
[24] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
[25] Heinrich G. Parameter Estimation for Text analysis[R]. Germany: Fraunhofer IGD, 2005.
[26] Cao J, Xia T, Li J, et al. A Density-based Method for Adaptive LDA Model Selection[J]. Neurocomputing, 2009, 72(7-9): 1775-1781.
[27] 张华平. ICTCLAS汉语分词系统[EB/OL].[2014-01-01]. http://ictclas.nlpir.org/. (Zhang Huaping. ICTCLAS Chinese Word Segmentation System[EB/OL].[2014-01-01]. http://ictclas.nlpir.org/.)
[28] 李荣陆. 复旦大学中文分类语料库[DB/OL].[2014-01-01]. http://www.datatang.com/data/43318. (Li Ronglu. Chinese Categorization Corpus from Fudan University[DB/OL].[2014-01-01]. http://www.datatang.com/data/43318. )
[29] 搜狗实验室. 文本分类语料库[DB/OL].[2013-08-22]. http://www.sogou.com/labs/dl/t.html. (Sogou Labs. Text Categorization Corpus[DB/OL].[2013-08-22]. http://www.sogou.com/labs/dl/t.html.)

[1] 王鸿, 舒展, 高印权, 田文洪. 一种单分类器联合多任务网络的隐式句间关系分析方法*[J]. 数据分析与知识发现, 2021, 5(11): 80-88.
[2] 吴彦文, 蔡秋亭, 刘智, 邓云泽. 融合多源数据和场景相似度计算的数字资源推荐研究*[J]. 数据分析与知识发现, 2021, 5(11): 114-123.
[3] 李振宇, 李树青. 嵌入隐式相似群的深度协同过滤算法*[J]. 数据分析与知识发现, 2021, 5(11): 124-134.
[4] 董淼, 苏中琪, 周晓北, 兰雪, 崔志刚, 崔雷. 利用Text-CNN改进PubMedBERT在化学诱导性疾病实体关系分类效果的尝试[J]. 数据分析与知识发现, 2021, 5(11): 145-152.
[5] 余传明, 张贞港, 孔令格. 面向链接预测的知识图谱表示模型对比研究*[J]. 数据分析与知识发现, 2021, 5(11): 29-44.
[6] 丁浩, 艾文华, 胡广伟, 李树青, 索炜. 融合用户兴趣波动时序的个性化推荐模型*[J]. 数据分析与知识发现, 2021, 5(11): 45-58.
[7] 华斌, 吴诺, 贺欣. 基于知识融合的政务信息化项目多专家审批意见整合*[J]. 数据分析与知识发现, 2021, 5(10): 124-136.
[8] 王媛, 时恺泽, 牛振东. 一种用于实体关系三元组抽取的位置辅助分步标记方法*[J]. 数据分析与知识发现, 2021, 5(10): 71-80.
[9] 杨辰, 陈晓虹, 王楚涵, 刘婷婷. 基于用户细粒度属性偏好聚类的推荐策略*[J]. 数据分析与知识发现, 2021, 5(10): 94-102.
[10] 戴志宏, 郝晓玲. 上下位关系抽取方法及其在金融市场的应用*[J]. 数据分析与知识发现, 2021, 5(10): 60-70.
[11] 汪雪锋, 任惠超, 刘玉琴. 融合聚类信息的技术主题图可视化方法研究 [J]. 数据分析与知识发现, 0, (): 1-.
[12] 王一钒,李博,史话,苗威,姜斌. 古汉语实体关系联合抽取的标注方法*[J]. 数据分析与知识发现, 2021, 5(9): 63-74.
[13] 车宏鑫,王桐,王伟. 前列腺癌预测模型对比研究*[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[14] 周阳,李学俊,王冬磊,陈方,彭莉娟. 炸药配方设计知识图谱的构建与可视分析方法研究*[J]. 数据分析与知识发现, 2021, 5(9): 42-53.
[15] 马江微, 吕学强, 游新冬, 肖刚, 韩君妹. 融合BERT与关系位置特征的军事领域关系抽取方法*[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn